This document is composed of 5 sections.
Please navigate using the tabs to see the different contents.
This tidy data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The question we would like to answer is: * Which chemical properties influence the quality of white wines
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Our dataset consist of 13 variables. The X variables is only a row identifier and will not be considered in the rest of this analysis. It means we have 12 meaningfull variables. It is composed of around 4900 observations.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## 7 6.2 0.32 0.16 7.0 0.045
## 8 7.0 0.27 0.36 20.7 0.045
## 9 6.3 0.30 0.34 1.6 0.049
## 10 8.1 0.22 0.43 1.5 0.044
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## 7 30 136 0.9949 3.18 0.47 9.6
## 8 45 170 1.0010 3.00 0.45 8.8
## 9 14 132 0.9940 3.30 0.49 9.5
## 10 28 129 0.9938 3.22 0.45 11.0
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## 7 6
## 8 6
## 9 6
## 10 6
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
This section shows several distribution charts for each variable. Please use the tab to navigate from one analysis to the other.
The orange chart is based on dataset information as they are. The red vertical line shows the 95% quantile threshold. The blue chart is based on dataset information without upper outliers. Outliers are identified using the Inter Quartile method. The grey chart show when relevant the data set information without outliers using a log10 scale. Associated descriptive statistics are provided (when relevant)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed acidity distribution is rather a normal distribution with an average value at 6.855 g/dm^3. We see there are some outliers with values beyond 10 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile acidity distribution is a rather normal with an average value of 0.2782 g/dm^3. We see there are some outliers with values greater than 0.5 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The citric acid distribution is rather normal with an average value of 0.3342 g/dm^3. We see there are some outliers with values above 0.6 g/dm^3.
We see a pic just below 0.5 g/dm^3. Has it is just below 0.5, it could be interesting to understand how the associated measures were done and if we do not have a measurement system limit in that case.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The desidual sugar distribution is skewed with an average value of 6.491 g/dm^3. There are some outliers with values above 22.5 g/dm^3.
When looking to the log 10 transformed distribution, we see a at least bimodal distribution. When looking to above 5 g/dm^3 sugar values, we can also say we have multimodal distribution.
It could be interesting to segregate information for wine having value less and above 5 g/dm^3 for residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Chlorides distribution is rather normal with an average value of 0.04577 g/dm^3. There are some outliers with values above 0.07 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Free sulfur dioxide distribution is rather normal with an average value of 35.31 mg/dm^3. There are some outliers with values above 80 mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total sulfur dioxide distribution is rather normal with an average value of 134 mg/dm^3. There are some outliers with values above 160 mg/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Wine density distribution is rather normal with an average value of 0.9940 g/cm^3. There are some outliers with values above 1.0025 g/cm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH distribution is rather normal with an average value of 3.188. There are some outliers with values above 3.55.
It means that Vinho Verder is a pretty acid wine. This is coherent with acidity of grape fruit juice (see https://en.wikipedia.org/wiki/PH#/media/File:216_pH_Scale-01.jpg).
Reminder: pH lader goes from 0 to 14. Neutral pH is 7. Values below 7 mean acidity. Value above 7 mean basicity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulfates distribution is rather normal with an average value of 0.4898 g/dm^3. There are some outliers with values above 0.76 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
When looking to alcohol distribution, we can see a skewed distribution. Average alcohol level of 10.51 %/vol. This a pretty soft wine (in average wines have 12% to 14% alcohol level). There are no real outliers when using the interquantile methodolody.
The log 10 transformation does not materialise any multimodal distribution.
We can identify 3 different quality groups. The low one with score up to 4. The medium ones with score of 5, 6 or 7. The good ones with socres of 8 or 9.
Most of the wines are considered having a medium quality. We can see we have very few wines considered as bad (quality = 3) and even fewer rated very good (quality = 9). We have no “crappy” wines (quality = 0 or 1) nor outstanding wines (quality = 10).
From the dataset, we will exclude all lines having at least one value considered as an outlier (based on the inter quantile method).
## [1] "Initial number of rows: 4898"
## [1] "Number of rows in cleaned dataset: 4074"
Please navigate using the tabs to see the two different correlation plots.
Quality seems to be negatively correlated with density and positively correlated with alcohol. But quality related correlation coefficient are pretty low (in absolute value)!
The high correlation are:
Quality is partially correlated with alcohol and density, chlorides.
Nevertheless, we see that the point cloud are often dispersed. It explains the correlation factor are often less than 0.5 (absolute value).
In the rest of this document, only aboslute values of correlation factors will be mentioned.
In this section, we will analyse the different correlation between quality and all other variables. Please use the tab to navigate from one analysis to the other.
## [1] "Correlation factor: 0.0524192864513668"
Correlation factor is 0.05. Watherver the quality is the average values are pretty much the same (between 6.6 and 7 g/dm^3). We see a very high dispersion of values, whatever the quality is.
We can not identify any correlation pattern between fixed acidity and quality.
## [1] "Correlation factor: 0.117125850001902"
Correlation factor is 0.12. Watherver the quality is the average values are pretty much the same (between 0.25 and 0.32 g/dm^3). We see a very high dispersion of values, whatever the quality is.
We can not identify any correlation pattern between volatile acidity and quality.
## [1] "Correlation factor: 0.0358270876981967"
Correlation factor is 0.04. If we exclude the low quality wines, we see that quality tend to increase with an increase of citric acidity.
Watherver the quality is the average values are pretty much the same (between 0.3 and 0.35 g/dm^3).
We can not identify any correlation pattern between volatile acidity and quality.
## [1] "Correlation factor: 0.10565960642498"
Correlation factor is 0.11. We can see that very good quality wines have a low residual sugar quantity. Quality of medium and good wines tends to increase with a decrease of residual sugar. This is logic as the vinho verder wine is supposed to be a dry wine.
There are much bigger dispersion of values and mean values for each wine quality. Nevertheless, we can not identify any correlation pattern between residual sugar and quality.
## [1] "Correlation factor: 0.279538296180308"
Correlation factor is 0.28. This is one of the highest correlation factors.
We can see that for low quality wine, quality increases when chlorides quantity is higher. For medium and good quality wine, we have the opposite trend.
Nevertheless, due to the high dispersion of values, we can not identify any correlation pattern between chlorides and quality.
## [1] "Correlation factor: 0.0170360242390806"
Correlation factor is 0.02. This is very low.
Here, we can see an interesting information. Low quality wines tend to have a low free sulfur dioxide figure. The sulfur dioxide in wine is used to avoid wine oxydation and other chemical reactions. We can think that these kinds of reactions occured for wines having lof free sulfur dioxide figures. Neverthelees, we can see that free sulfur dioxide figures are very dispersed whatever the wine quality is. Therefore, we can not identify any correlation pattern between free sulfure dioxide and quality.
## [1] "Correlation factor: 0.165039356154331"
Correlation factor between these two variables is pretty low (0.17).
We can see that a distinction between low quality wines and medium/good ones.
For good/medium quality wines, the lower the total sufur dioxide is, the better it is. We see that low quality wines, total sulfur dioxide mean values are the lowest ones (about 110 mg/dm^3).
Nevertheless, big data dispersion does not allow us any correlation conclusion.
## [1] "Correlation factor: 0.298326809705554"
The correlation factor between density and quality is one of the highest with a value of 0.30. For good/medium quality wines, the lower the density is, the better it is. As density and residual sugar are highly correlated, this behaviour is not a surprise.
Nevertheless, big data dispersion does not allow us any correlation conclusion.
## [1] "Correlation factor: 0.0791526348942642"
Correlation coefficient between quality and pH is 0.08.
pH values are very similar whatever the wine quality is. There is also a high values dispersion for all wine qualities. It does not allow us any correlation conclusion.
## [1] "Correlation factor: 0.0181793067982046"
Here also, the correlation factor is pretty low with a value of 0.02. Average values are very close whatever the quality is. There is also a high values dispersion for all wine qualities. It does not allow us any correlation conclusion.
## [1] "Correlation factor: 0.422835777947479"
Correlation factor between alcohol and quality is one of the highest with a value of 0.42.
For alcohol of medium and good wines, we see that quality increase with higher alcohol degrees. This is coherent with residual sugar values. When sugar value decreases, it means it has been transformed into alcohol. This correlation is definitevely not a surprise.
On every correlation analysis, due to the high dispersion of value, any possible correlation can not really be assessed.
In this section, we explore the two highest variable correlation. Please use the tabs to navigate from one chart to another.
We can see there is a correlation between wine dentisy and alcohol.
We can also see a correlation trend between density and residual sugar. We can observe also a high dispersion of density for low residual sugar values.
From the two above chart, we can see that the more residual sugar you have, the bigger the density is, and the lower the alcohol is. This output is not a surprise as the vinho verde white wine is supposed to be a dry wine.
In this section, we will to analys correlation of alcohol, density and chlorides vs Quality. We will perform a bivariable analysis versus quality. It means we will get 3 different charts. With these charts, we will try to identify specific quality cluster and trend depending of the two watched variables.
Please use the tab to navigate from one chat to another.
On these 3 charts, we can not see any data cluster popping out. In addition, we see the various trend lines are pretty similar. We see we can are not able to create a model based on two variable to estimate the wine quality.
We should create a model taking into account more variable to try estimate the wine quality. Nevertheless, we will explain in the reflexion section why we are not confident in such approach.
This section shows 3 differents charts and main outcome for each of them.
In this chart, we see that we see two groups of wine. The one having more than 5 g/dm^3 residual sugar and the other ones. As the vinho verder is a dry wine, we expect it to get low residual sugar value.
In this chart, we can see that some wine grower do not fully master the wine making process. They did not includes enough sulfur dioxide to ensure a non wine oxydation. It results in a poor quality wine.
In this chart, we can not see any wine quality cluster popping up. The trend lines are pretty much the same (except the high quality one (9) but not really representative due to the low number of measures).
This wine chemistry variable analysis is really interesting. The main expected outcomes can be seens through data analysis (link between residual sugar, density and alcohol). We can also point out some issues during the wine making process by identifying that low suflur dioxide values leads to poor wine quality.
For quality projection, we see that there is not easy correlation that can be found. We have very low correlation max factors (the highest one is less than 0.5).
We can ask ourselve about the dataset itself. Do we miss some valuable information to better assess the wine quality. For instance, we do not know the wine vintage. For also do not know the grape variety used for each of the tested wines. What was the land quality on which the grappe growed.
In addition, if wine quality could only be identified through its chemical factors, we could easily creates artificial wines. This is not the case. I could only find one startup aiming to be able to create artificial wine : Ava Winery (http://www.avawinery.com). This fake wine seems not to be really good when tasted (http://www.ouest-france.fr/leditiondusoir/data/747/reader/reader.html#!preferred/1/package/747/pub/748/page/5 (french article)).
See https://www.newscientist.com/article/2088322-synthetic-wine-made-without-grapes-claims-to-mimic-fine-vintages/ http://www.radionz.co.nz/national/programmes/thiswayup/audio/201807662/lab-made-wine
Nevertheless, artificial wines can not be sold in France since 1905 due to anti-fraud low.
It that condition, I think nothing will be better than real wine tasting to identify if you like it or not.
This wine analysis is very interesting. The dataset had already a very quality and a significant number of observations to be representative. Even after removed the outliers we still have enough observation. The dataset was tidy. It makes all computations pretty fast.
The dataset contains (except the quality field) continuous variables. It helps the R code factorisation.
P4 lessons and project sample were usefull to build this analysis, in term of techniques (R principles and syntax) but also in term of analysis organisation.
I get some trouble making R markdown work especially with the list items. I struggled getting the layout I wanted, but I finally succeeded.
I found interesting graphical data representation using google (especially the correlation matrix).
I did not respect the R coding rules at the beginning. I modified the code for the second submission. I find some coding recommendations useful, but others seems to come from an another age. Especially the 80 characters limits as we do not use VT terminals for ages.
If we focus on the analysis itself, we were asked to answer a specific question. Within my analysis, I could not really answer this question. I was able to identify one item explaining why you get a bad wine, but could not find items explaining you get a good wine. In this analys, I only explored univariate and bivariate analysis. It means that they are a lot of possibilities I did not explore (from 4 variables up to 12 variables).
To be rigourous and to assert we can not find correlation between wine parameters and quality, I should have done all the possible analysis. I would have been a time consuming activity. Litterature conforted me thinking I would not find any reason to fully explain the wine quality. Nevertheless, this is only an assumption that I did not really demonstrate. It would have been much easier to find a parameters correlation explaining the wine quality. In our case, we need to be very carefull when writing our conclusion in order to make our conclusion indisputable.